Mini Project

We are going to use Enron dataset again to learn regressions.

We will try to infer "bonus" (target) from "salary" (inputs) of employees.

As always, ensure python 2 environment first (else modify as per udacity's instructions to run in python 3.x)

In [6]:
# ensuring python version
import sys
sys.version
sys.version_info
Out[6]:
sys.version_info(major=2, minor=7, micro=15, releaselevel='final', serial=0)

Visualizing Regression data

First ensure skeleton script imported from udacity runs fine..

In [7]:
#!/usr/bin/python

"""
    Starter code for the regression mini-project.
    
    Loads up/formats a modified version of the dataset
    (why modified?  we've removed some trouble points
    that you'll find yourself in the outliers mini-project).
    Draws a little scatterplot of the training/testing data
    You fill in the regression code where indicated:
"""    

%matplotlib inline 
import sys
import pickle
#sys.path.append("../../tools/")
from feature_format import featureFormat, targetFeatureSplit 
dictionary = pickle.load( open("../17. Final Project/final_project_dataset_modified.pkl", "r") )

### list the features you want to look at--first item in the 
### list will be the "target" feature
features_list = ["bonus", "salary"]
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )

### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = "b"
test_color = "r"



### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and 
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.



### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
    plt.scatter( feature, target, color=test_color ) 
for feature, target in zip(feature_train, target_train):
    plt.scatter( feature, target, color=train_color ) 

### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")




### draw the regression line, once it's coded
try:
    plt.plot( feature_test, reg.predict(feature_test) )
except NameError:
    pass
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()

Slope and Intercept

Import LinearRegression from sklearn, and create/fit your regression. Name it reg so that the plotting code will show it overlaid on the scatterplot. Does it fall approximately where you expected it?

Extract the slope (stored in the reg.coef_ attribute) and the intercept. What are the slope and intercept?

In [10]:
#!/usr/bin/python

"""
    Starter code for the regression mini-project.
    
    Loads up/formats a modified version of the dataset
    (why modified?  we've removed some trouble points
    that you'll find yourself in the outliers mini-project).
    Draws a little scatterplot of the training/testing data
    You fill in the regression code where indicated:
"""    

%matplotlib inline 
import sys
import pickle
#sys.path.append("../../tools/")
from feature_format import featureFormat, targetFeatureSplit 
dictionary = pickle.load( open("../17. Final Project/final_project_dataset_modified.pkl", "r") )

### list the features you want to look at--first item in the 
### list will be the "target" feature
features_list = ["bonus", "salary"]
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )

### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = "b"
test_color = "r"



### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and 
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit (feature_train,target_train)
print reg.coef_
print reg.intercept_


### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
    plt.scatter( feature, target, color=test_color ) 
for feature, target in zip(feature_train, target_train):
    plt.scatter( feature, target, color=train_color ) 

### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")




### draw the regression line, once it's coded
try:
    plt.plot( feature_test, reg.predict(feature_test) )
except NameError:
    pass
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()
[5.44814029]
-102360.54329388007

Slope is -5.45 and Intercept, -102360.5

Regression Score: Training Data

Imagine you were a less savvy machine learner, and didn't know to test on a holdout test set. Instead, you tested on the same data that you used to train, by comparing the regression predictions to the target values (i.e. bonuses) in the training data. What score do you find?

In [11]:
from sklearn.metrics import r2_score

# predicting the 'bonus' from 'training inputs/salaries'
target_predictions = reg.predict(feature_train)

# comparing predicted 'bonus' with 'training' bonus
score = r2_score(target_train, target_predictions)  
score
Out[11]:
0.04550919269952436

Regression Score: Test Data

Now compute the score for your regression on the test data, like you know you should. What's that score on the testing data?

In [12]:
from sklearn.metrics import r2_score

# predicting the 'bonus' from 'TEST inputs/salaries'
target_predictions = reg.predict(feature_test)

# comparing predicted 'bonus' with 'TEST' bonus
score = r2_score(target_test, target_predictions)  
score
Out[12]:
-1.484992417368511

Regressing Bonus Against LTI

Regress the bonus against the long term incentive, and see if the regression score is significantly higher than regressing the bonus against the salary. Perform the regression of bonus against long term incentive--what's the score on the test data?

Step 1: Redo regression, this time with LTI. Changes noted with comment 'CHANGED HERE'

In [14]:
#!/usr/bin/python

"""
    Starter code for the regression mini-project.
    
    Loads up/formats a modified version of the dataset
    (why modified?  we've removed some trouble points
    that you'll find yourself in the outliers mini-project).
    Draws a little scatterplot of the training/testing data
    You fill in the regression code where indicated:
"""    

%matplotlib inline 
import sys
import pickle
#sys.path.append("../../tools/")
from feature_format import featureFormat, targetFeatureSplit 
dictionary = pickle.load( open("../17. Final Project/final_project_dataset_modified.pkl", "r") )

### list the features you want to look at--first item in the 
### list will be the "target" feature
features_list = ["bonus", "long_term_incentive"]  #CHANGED HERE
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )

### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = "b"
test_color = "r"



### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and 
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit (feature_train,target_train)
print reg.coef_
print reg.intercept_


### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
    plt.scatter( feature, target, color=test_color ) 
for feature, target in zip(feature_train, target_train):
    plt.scatter( feature, target, color=train_color ) 

### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")




### draw the regression line, once it's coded
try:
    plt.plot( feature_test, reg.predict(feature_test) )
except NameError:
    pass
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()
[1.19214699]
554478.7562150091

Step 2: Predict and calculate the score

with test data..

In [15]:
from sklearn.metrics import r2_score

# predicting the 'bonus' from 'TEST inputs/long term incentives'
target_predictions = reg.predict(feature_test)

# comparing predicted 'bonus' with 'TEST' bonus
score = r2_score(target_test, target_predictions)  
score
Out[15]:
-0.5927128999498643

Sneak Peek: Outliers Break Regressions

Add these two lines near the bottom of finance_regression.py, right before plt.xlabel(features_list[1]):

reg.fit(feature_test, target_test)
plt.plot(feature_train, reg.predict(feature_train), color="b")

(brightness reduced for training and test data points to make regression line more visible)

In [17]:
#!/usr/bin/python

"""
    Starter code for the regression mini-project.
    
    Loads up/formats a modified version of the dataset
    (why modified?  we've removed some trouble points
    that you'll find yourself in the outliers mini-project).
    Draws a little scatterplot of the training/testing data
    You fill in the regression code where indicated:
"""    

%matplotlib inline 
import sys
import pickle
#sys.path.append("../../tools/")
from feature_format import featureFormat, targetFeatureSplit 
dictionary = pickle.load( open("../17. Final Project/final_project_dataset_modified.pkl", "r") )

### list the features you want to look at--first item in the 
### list will be the "target" feature
features_list = ["bonus", "salary"]
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )

### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = '#BBDEFB' #"b"
test_color = '#FFCDD2'#"r"



### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and 
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit (feature_train,target_train)
print reg.coef_
print reg.intercept_


### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
    plt.scatter( feature, target, color=test_color ) 
for feature, target in zip(feature_train, target_train):
    plt.scatter( feature, target, color=train_color ) 

### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")


### draw the regression line, once it's coded
try:
    plt.plot( feature_test, reg.predict(feature_test))
except NameError:
    pass

# OUTLIERS
reg.fit(feature_test, target_test)
plt.plot(feature_train, reg.predict(feature_train), color="r")

plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()
[5.44814029]
-102360.54329388007

What is the slope of the new regression line?

In [18]:
print reg.coef_
[2.27410114]